Main Questions
Question 1 & 2
df %>%
filter(Year == 1962) %>%
ggplot(aes(y = co2PerCap, x = gdpPercap)) +
theme_classic() +
geom_point(color = "red") +
labs(y = "CO2 emissions (metric tons per capita)", x = "GDP in purchasing power parity (USD per capita)") +
ggtitle("GDP vs. CO2 emissions in 1962")

df %>%
filter(Year == 1962) %>%
ggplot(aes(y = co2PerCap, x = gdpPercap)) +
theme_classic() +
scale_y_log10() +
scale_x_log10() +
geom_point(color = "red") +
ggtitle("log GDP vs. log CO2 emissions in 1962") +
xlab("log GDP in purchasing power parity (USD per capita)") +
ylab("log CO2 emissions (metric tons per capita)")
After visualizing the original data, we see that there are some large
values that are far from most of the smaller values which appear
clustered/close to each other. It appears as a GPD per capita increases,
CO2 emissions increases at a faster rate, up until the GDP per capital
is at about 200,000. We cannot determine if the relationship between the
x and y values are linear by just visualizing them.
However, given that the order of magnitude of both x and y values are
large, we log transform both x (GDP) and y values (CO2).
Question 3
df <- df %>%
mutate(logCO2 = log10(co2PerCap), logGDP = log(gdpPercap))
mod <- cor.test(x = df$logCO2, y = df$logGDP) %>% tidy()
mod
## # A tibble: 1 × 8
## estimate statistic p.value parameter conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr>
## 1 0.902 71.8 0 1182 0.891 0.912 Pearson's… two.sided
Pearson’s correlation coefficient indicates the strength of the
relationship between the two variables. Log GDP is positively associated
with log CO2 at r=0.9.
Question 4
res <- df %>%
group_by(Year) %>%
summarise(
tidy(
cor.test(x = co2PerCap, y = gdpPercap, method = "kendall")
)
) %>%
dplyr::slice_max(estimate, n = 1)
res
## # A tibble: 1 × 6
## Year estimate statistic p.value method alternative
## <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 2002 0.780 12.9 4.37e-38 Kendall's rank correlation tau two.sided
Kendall’s Tau correlation between CO2 emissions and GDP per capita is
the highest during year 2002, at r=0.78.
Question 5
fig <- df %>%
filter(Year == res$Year) %>%
plot_ly(
x = ~logGDP,
y = ~logCO2,
size = ~pop,
color = ~continent,
# frame = ~Year,
text = ~`Country Name`,
hoverinfo = "text",
type = "scatter",
mode = "markers"
)
fig <- fig %>% layout(
xaxis = list(
type = "log"
)
)
fig <- fig %>% animation_opts(
1000,
easing = "elastic", redraw = FALSE
)
fig %>%
layout(
title = "log GDP vs. log CO2 emissions in 2002", plot_bgcolor = "#e5ecf6", xaxis = list(title = "log CO2 Emissions"),
yaxis = list(title = "log GDP"), legend = list(title = list(text = "<b> Continent </b>"))
)
The interactive plot above depicts the relationship between CO2
emissions and GDP per capita in the year (2002) where the correlation
between the two variables is the highest as demonstrated in the question
above. Hovering over the dots displays the country names, and the dot
sizes correspond to the population size of that country.
More Questions
Question 1
What is the relationship between between continent and ‘Energy use (kg
of oil equivalent per capita)’?
res <- df %>%
filter(!is.na(continent)) %>%
kruskal.test(continent, energyUsePerCap) %>%
tidy()
We use the Kruskal-Wallis test because it is a non-parametric version of
ANOVA. It does not assume normal distribution of residuals The test
works on 2 or more independent samples, which may have different sizes.
There is a significant relationship between continent and energy use,
as the p-value is smaller than the significant threshold, which we set
at 0.05. The p-value is negligible because it is very clsoe to 0
Question 2
Is there a significant difference between Europe and Asia with respect
to ‘Imports of goods and services (% of GDP)’ in the years after 1990?
mod <- df %>%
filter(continent %in% c("Asia", "Europe"), Year > 1990) %>%
glm(importPercentageGDP ~ continent, data = .) %>%
tidy()
mod
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 46.8 2.61 17.9 3.36e-44
## 2 continentEurope -5.06 3.56 -1.42 1.58e- 1
While there are many candidate statistical tests we could use to compare
the difference in the variable of interest between two groups, a simple
linear regression is chosen, because:
We can find out to what extent does the regressor (continent type)
affects the regressand (imports of goods and services in terms of % of
GDP).
\[\begin{equation}
Y_i = \beta_0 + \beta_1 continent + \epsilon_i
\end{equation}\]
The null hypothesis is whether \(\beta_{1}\) = 0, where variable Continent =
1 if Europe, = 0 if Asia.
We fit a linear regression model to compare the two groups. There is no
significant difference between Europe and Asia with respect to the
amount of imports of goods and services in terms percentage of GDP
(p<0.05).
A t-test would have also provided us the answer to the question above;
linear regression provides the additional advantage of informing us to
what extent a change from Asia=0 to Europe=1 affect outcome variable
(imports of goods and services in terms of % of GDP), which is indicated
by the beta weight, -5.06.
Question 3
What is the country (or countries) that has the highest ‘Population
density (people per sq. km of land area)’ across all years? (i.e., which
country has the highest average ranking in this category across each
time point in the dataset?
df %>%
select(Year, `Country Name`, popDensityPerSqKm) %>%
arrange(Year, desc(popDensityPerSqKm)) %>%
group_by(Year) %>%
slice(1:3) %>%
ggplot(data = ., aes(x = as.factor(Year), y = popDensityPerSqKm, fill = as.factor(`Country Name`))) +
geom_bar(position = "dodge", stat = "identity") +
theme_classic() +
labs(x = "Year", y = "population density (per sq.km)", fill = "Country") +
ggtitle("Population density in the top 5 highest density countries in Years 1962-2007")

res <- df %>%
select(Year, `Country Name`, popDensityPerSqKm) %>%
arrange(Year, desc(popDensityPerSqKm)) %>%
group_by(Year) %>%
slice(1:3) %>%
mutate(
rnks = row_number(desc(popDensityPerSqKm))
) %>%
group_by(`Country Name`) %>%
summarize(mean.rank = mean(rnks))
res
## # A tibble: 4 × 2
## `Country Name` mean.rank
## <chr> <dbl>
## 1 Hong Kong SAR, China 3
## 2 Macao SAR, China 1.5
## 3 Monaco 1.5
## 4 Singapore 3
The highest-rank country in terms of population density changes
across the years, as we can tell from the graph above
To find out which country has the highest averaged ranking, we take
the average of their ranks across the years based on their population
density. Hong Kong SAR, China, Macao SAR, China, Monaco, Singapore are
tied at the first place because their averaged ranking across the period
1962-2007 is the same at 3, 1.5, 1.5, 3.
Question 4
What country (or countries) has shown the greatest increase in ‘Life
expectancy at birth, total (years)’ since 1962?
res <- df %>%
select(Year, `Country Name`, `Life expectancy at birth, total (years)`) %>%
group_by(`Country Name`) %>%
summarise(
diff = `Life expectancy at birth, total (years)`[Year == 2007] - `Life expectancy at birth, total (years)`[Year == 1962],
.groups = "drop"
) %>%
dplyr::slice_max(diff, n = 5)
res
## # A tibble: 5 × 2
## `Country Name` diff
## <chr> <dbl>
## 1 Maldives 36.9
## 2 Bhutan 33.2
## 3 Timor-Leste 31.1
## 4 Tunisia 30.9
## 5 Oman 30.8
res %>%
ggplot(aes(x = reorder(`Country Name`, -diff), y = diff)) +
geom_bar(position = "dodge", stat = "identity", fill = "lightblue") +
theme_classic() +
ggtitle("Increase in Life Expectancy in Years (Period: 1962-2007)") +
ylab("Years") +
xlab("Country") +
geom_text(aes(label = round(diff, 2)), position = position_dodge(width = 0.9), vjust = -0.25)
From the graph above, we see that the top 5 countries that has shown the
greatest increase in life expectancy are: Maldives, Bhutan, Timor-Leste,
Tunisia, Oman
This answer is based on the absolute difference in life expectancy
between year 2007 and year 1962.